Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OpenMP thread allocation in Linux #5551

Merged
merged 15 commits into from
Nov 29, 2022
Merged

Conversation

svotaw
Copy link
Contributor

@svotaw svotaw commented Oct 19, 2022

For the streaming push APIs that are designed to work multi-threaded, we observed consistent failures in Linux. (e.g LGBM_DatasetPushRowsByCSRWithMetadata). These APIs rely on pre-allocating batches of sparse push arrays to avoid thread collisions. To do the initial allocation during LGBM_InitStreaming, we were using OMP_NUM_THREADS(). This works just fine on Windows, where tested.

However, in Linux, it appears that each calling thread uses its own space to determine OpenMP thread statistics. Neither OMP_NUM_THREADS() nor omp_max_num_threads() return the same value for different calling threads.

To fix this, this PR changes to use a static MAX_THREAD_ALLOCATION to simply pre-allocate based on a known constant thread space. This wastes a small amount of memory due to overallocation (empty vectors), but testing confirms that it now works to run multi-threaded in Linux as well.

Note that there are multi-threaded unit tests to cover this, but due to the fact that OpenMP is not supported in the C++ unit tests framework, the test threading is implemented manually and hence always succeeds. This issue is only seen when OpenMP is active and we are running in Linux.

Example:

2 external Spark Tasks are calling LGBM_DatasetPushRowsByCSRWithMetadata to load a Dataset in parallel. Before pushing, Spark Task thread 0 calls LGBM_InitStreaming with an expected 2 external calling threads (1 for each Spark Task). OMP_NUM_THREADS() returns 3 threads, so the sparse bins allocate 6 independent push buffers (2 external threads * 3 OpenMP threads per external thread) to avoid thread collisions.

The expected internal_thread_id range of Spark Task thread 0 is 0-2 and the range of Spark Task thread 1 is 3-5, and these are used to index the correct thread-safe sparse push buffer.

However (in Linux specifically) Spark Task thread 1 OMP_NUM_THREADS() returns 4, a different value than what Spark Task thread 0 saw during initial allocation (3). This results in Spark Task thread 1 calling LGBM_DatasetPushRowsByCSRWithMetadata and thinking that it's internal_thread_id range is 4-7 (as opposed to 3-5). This causes a JVM crash since trying to index sparse push buffer 6 or 7 is out of range.

Also observed: even if you add the ability to dynamically add push buffers (to avoid out-of-range), you still get indexing collisions. 2 external threads can end up using the same internal_thread_id due to the way we calculate it. The OMP_NUM_THREADS() is expected to be a constant to generate non-overlapping OpenMP tid ranges, but in Linux it seems to vary depending on which thread called it.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 8, 2022

@guolinke @shiyu1994 Just checking in for updates. We are awaiting this fix to unblock a release.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Just left two quick questions. Could you please help to address? I'll make this merged as soon as possible.

include/LightGBM/utils/openmp_wrapper.h Outdated Show resolved Hide resolved
@guolinke
Copy link
Collaborator

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

@svotaw
Copy link
Contributor Author

svotaw commented Nov 10, 2022

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT?

However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results.

1 similar comment
@svotaw
Copy link
Contributor Author

svotaw commented Nov 10, 2022

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT?

However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results.

@guolinke
Copy link
Collaborator

okay, I see. Another solution I suggest is to dynamically increase the buffer, when tid is larger than buffer size. But this solution may slow down due to the additional if-else branching.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 12, 2022

okay, I see. Another solution I suggest is to dynamically increase the buffer, when tid is larger than buffer size. But this solution may slow down due to the additional if-else branching.

Correct, I didn't want to adversely affect a shared critical pathway with an extra comparison, so I would rather avoid using dynamic buffer size. Also, without a fixed size, you can't generate deterministic thread ids across threads, so it ends up having thread collisions anyways.

I'm taking a better look at the OpenMP APIs now that I understand the issue better and you guys have pointed out some good responses. Should have something up today.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 13, 2022

After a more thorough investigation, here's what I found with OpenMP:

As a reminder, what we need is for a constant that represents how many OpenMP threads each external thread will create. With this constant, we can pre-allocate sets of buffers (i.e., 1 set of X buffers for each of Y calling threads, and we need to find X and are given Y).

The ideal candidate would be omp_get_max_threads(). However, this returns a different number depending on the calling thread (our test cases get 3 and 4 respectively on 2 external threads). This seems to only happen in Linux, but that could be coincidence because of something else. They even say in the docs that you should use this if you need to make allocations, but apparently, they weren't considering an external multi-threaded environment. It kinda makes sense since OpenMP would be the multi-threading solution already for most people.

I tried all the APIs that OpenMP has to tweak thread counts: OMP_DYNAMIC, OMP_THREAD_LIMIT, OMP_NUM_THREADS, etc.
(Note: OpenMP actually recommends NOT looking at env vars directly, and using their APIs to access params, so although we discussed using env vars above, OpenMP already exposes APIs for all their params that also take into account defaults and such)

  1. omp_set_dynamic() to either 0 or 1 had no effect. Even with dynamic == 0 (which was actually the default), we get different values for omp_get_max_threads() from each external thread.
  2. omp_get_thread_limit() was not much use. By default, it actually returned Max(int32), which is not useful for allocation. I see no reason for us to set this manually either to something like 128 or 1024. Docs say default is implementation dependent.
  3. I used omp_get_active_level() to make sure there wasn't some weirdness of nesting somehow, but both external threads were on same level.
  4. as @guolinke suggested, using omp_set_num_threads(X) or parallel numThreads(X) worked, and omp_get_max_threads() returned a constant X after that. However, I don't like this solution. OpenMP and users should be free to choose thread count based on hardware and other needs, rather than us hardwiring it. We might interfere with users setting the values to other things that they want.

Here's my thoughts for a flexible solution that is better than the static constant from the first iteration of this PR:

  1. We create a "new" OpenMP param for LightGBM: omp_get_streaming_max_threads(), and we default it to some reasonable value like 8 or 16. This should work in the vast majority of cases, with no user actions needed, and reduces wasted allocations (from 128 or even 1024).
  2. For those special cases where this doesn't work (OpenMP has enough resources to create more threads than that), we allow overriding the default with an env var: OMP_MAX_STREAMING_THREADS. This is kind of how OpenMP already works, with params and overrides.

See the latest iteration for an example of this suggestion. We can still debate details of caching, naming, location of env var utils, etc., but this works from the more extensive testing I did. It works out-of-the-box, and still is flexible for corner cases.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 13, 2022

@guolinke Actually, having looked at it and thought about it for another day, it might be better to choose another way to set this value. Since this isn't really an OpenMP setting, perhaps it doesn't make sense to set it in openmp_wrapper. Also, I don't really like the static var.

I can think of 2 other options:

  1. As an optional argument to LGBM_InitStreaming(), defaulting to 8 or 16, and stored as a private class var in Dataset.
  2. As a config option, defaulting to 8 or 16 and again stored as private class var in Dataset

I will implement #1 in another iteration when I get a chance.

@guolinke
Copy link
Collaborator

@svotaw Thank you, I agree with #1

@svotaw
Copy link
Contributor Author

svotaw commented Nov 14, 2022

@svotaw Thank you, I agree with #1

@guolinke Done. Ready for final review then.

@svotaw svotaw requested review from shiyu1994 and removed request for guolinke November 14, 2022 23:02
@svotaw
Copy link
Contributor Author

svotaw commented Nov 14, 2022

@guolinke @shiyu1994

I think this is the best fix, although of course you are free to comment. I believe it's ready for checkin, other than your comment on whether we should use 8 or 16 as a default (see TODO in dataset.h). Note the failing R test is most likely a flake.

@shiyu1994
Copy link
Collaborator

It seems that some cpp tests are failing. Will look into this tomorrow.

src/c_api.cpp Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
@svotaw svotaw requested review from guolinke and removed request for shiyu1994 November 20, 2022 06:27
Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM!

@guolinke
Copy link
Collaborator

Let us fix the CI next

@svotaw
Copy link
Contributor Author

svotaw commented Nov 21, 2022

Thanks for the signoff and feedback! Just to be clear, is this a general CI problem, or something related to this PR? (I assume the former, but making sure)

@jameslamb
Copy link
Collaborator

is this a general CI problem, or something related to this PR

Please update to latest master. If you see any CI failures after doing that, they're likely related to this PR's changes, and you'll need to address them.

To be clear, we never merge PRs in this repo where CI is failing, even if the failures seem unrelated to the PR's changes.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 21, 2022

sure, I guess I meant is this something I need to merge a fix for or something that the backend pipeline just needs to be re-run? I merged from master.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 21, 2022

After sync with main, seeing the below error in a few R tests. I will try a re-run to confirm it's real.

The downloaded binary packages are in
C:\Users\runneradmin\AppData\Local\Temp\RtmpgztNMe\downloaded_packages
Downloading https://github.com/microsoft/LightGBM/releases/download/v2.0.12/miktexsetup-5.2.0-x64.zip
Setting up MiKTeX
miktexsetup_standalone: Timeout was reached
miktexsetup_standalone: Data: code="28", url="https://api2.miktex.org/hello"
Error: Process completed with exit code -1.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 22, 2022

I fixed all the tests, except for the R tests which are failing due to MikTeX failing somehow. If that's something related to this PR, can someone point me to a possible cause? I don't know anything about MikTeX.

@jameslamb
Copy link
Collaborator

except for the R tests which are failing due to MikTeX failing somehow

That looks unrelated to this PR's changes, as we're seeing it elsewhere on PRs and master. See #5600.

@svotaw
Copy link
Contributor Author

svotaw commented Nov 29, 2022

@jameslamb Can this be checked in?

@jameslamb jameslamb changed the title fix: Fix OpenMP thread allocation in Linux Fix OpenMP thread allocation in Linux Nov 29, 2022
@jameslamb
Copy link
Collaborator

I'll merge this, based on @guolinke 's approval.

Normally I'd label a change like this, which adds a new required argument to a public API in c_api.h, a "breaking" change, but I think fix is a fine categorization given that the LGBM_DatasetInitStreaming() was just introduced in #5299 and has never been part of a LightGBM release.

Thanks for the help!

@jameslamb jameslamb merged commit 4c5d0fb into microsoft:master Nov 29, 2022
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants